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ABSTRACT 

A ^'Fundamentals of Speech Communication** course is 
required of all college students, and upon completion of such a 
course students should possess those basic speaking and listening 
skills necessary to coB^plete successfully their college educations. 
With a view toward^ developing a new, more effective listening test, a 
study examined research in listening test development. The study also 
explained how the University of Wisconsin'-Oshkosh Listening Test was 
developed and reported what was learned regarding the properties and 
performance of the test. The validity of the test was assessed using 
three procedurcss: the content procedure, the predictive procedure, 
and the known^^groups method. Validity was also promoted by 
implementing suggestions and findings reported by listening 
assessment theorists. Two typical kinds of reliability tests were 
conducted in the test: test-retest and the Kuder-Richardson #20« Bias 
in r:.'gard to gender did not appear to exist in the test. No claims 
can be made concerning rdce bias since few minority students have 
taken the test. The test was subjected to item analysis to check for 
difficulty and discriminating power. The test can be administered 
successfully to up to 30 students in a single classroom by use of a 
one-half inch VHS video playback unit and monitor. Of 916 students 
tested, scores ranged from a low of 15 to a high of 52. The mean 
score was 34.77 with a standard deviation of 5.55. (A profile of the 
questions on the test and 24 references are attached.) (RS) 
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THE VALIDATION OF A LISTENING TEST 
FOR COLLEGE/UNIVERSITY STUDENTS 

INTRODUCTION 

The decade of the 1980s viinessed an unprecedented 
tecognitfon cf the vital role of listening in all aspects of a 
person's life. The failure to listen effectively has been 
advanced as the primary reason for all kinds of probleas, ranging 
from relational problens to aistakes and inefficiencies in the 
workplace CSteil, Barker, s Watson, 1983; Wolff, Marsnik, Tacey, 
£ Nlcholc, 1983; Wolvin £ Coakley, 1992). 

The 96-111 "Fundaaentals 
of Speech Coinniunication'' course is required of all students, and 
it is charged with the responsibility of insuring tliat upon 
completion of the course students will possess those basic 
speaking and listening skills necessary to successfully complete 
their college eaucatlon and to perform as effective coanunicators 
in thoir careers. A »ajor Mission of the university involves 
teacher preparation. Those of us in the Conwunlcation Department 
were particularly interested In a Department 
of Public Instruction rule adopted In 1987 which required 
"Demonstrated proficiency In speaking and listening as determined 
by the institution (preparing teachers)** 

As a result of the conditions described above, the 96-111 
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Instructional «ta££ took the Initiative to atrengthen the 
listening dimension of the course. First, ve identified the 
content to be tsught and some learning activities and exercises 
designed to develop listening skills. Mext, ve searched for a 
way to assess the skills. After reviewing the available 
instruMents, ve selected the video version of the Watson-Barker 
Listening Test as best Meeting our iuiedlate need. We 
discovered, however, that even vith soae obvious strengths of the 
test, we still were In need of a wore appropriate and effective 
Instrument to assess students in our basic course. 80 two of the 

vere encouraged to 

develop a listening test suitable for our use. Three years 
later, this test became a reality and was naaed the 
University of Wisconsin-Oshkosh Listening Test. 

The purpose of this report is to review research In test 
developm«Mt, listening test development in particular; expldln 
how we used this research in the development of our test; and to 
repot L what we liave learned regarding the properties and 
performance of the test. 

VALIDITY 

The first concern in the development of a test Is Its 
validity: does it indeed Measure what It purports to neetsure? 
We liave assessed the validity of the test by three procedures 
described by Smith: the content procedure, the predictive 
procedure, and the known-groups method (Smith, 1988). 

Cojitent validity, sometimes called face validity, asks If 
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the Instrument measures a representative sample o£ the skills 
that comprise effective listening. This sample should be 
consistent with listening literature in general and the textbook 
for the course irhich was COMUnlCflt Inn IfnTlri.. 3rd ^ii^Mnn by 
Gamble and Gamble. A mtudy of both sources shows that 
comprehension, defined as "...to understand the message In order 
to retain, recall, and - possibly - use the information at a 
later time" (Wolvln « Coakley, 1992) Is the most basic purpose of 
listening. Early listening tests such as the Brown-Carlson and 
the STEP tests focused on comprehension. Thus, most of the 
questions should address comprehension, which is true of 38 of 
the 55 questions (69\) on the test. 

Another purpose of listening is called critical or 
evaluative listening. The critical listener' evaluates what is 
heard on the basis of sound logic or reasoning (Bro«nell, 1986). 
While not used as frequently as basic comprehension, critical 
listening is recognized by many experts as an important listening 
purpose in tho wide-spread attention now being given to the 
development of critical thinking skills across all higher 
education curricula. Thirteen of the questions (24\) involve 
this kind of listening. 

A final purpose of listening as explained In our textbook 
and taught in the course is empathic listening. Previous 
iistening tests have not attempted to directly assess this kind, 
but it has been addressed in listening literature for some time 
(Wolff, Marsnik, Tacey, « Nichols, 1983), and has received 
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Increased attention In recent years (Bruneau, 1989; Thoallson, 
1990). It is sonetines treated as part of therapeutic listening 
(Wolvin (k Coakley, 1992). Consequently, four of the fifty-five 
Questions (7\) address eapathic listening. ( A list of the 
questions and vhat each one tests appears on the last page of 
this report.) 

The second kind of validity study done vds predictive 
validity, defined as comparing a behavior that is an iaportant 
aanifestatloii of the construct being aeasured with scores on an 
instrument designed to ■easuro the same construct. For tlils 
purpose, we compared scores on our test with those on the 1991 
Watson-Barker Listeniny Test (WBLT), The WBLT was developed In 
1991 as a revision of the original Watson-Barkex test of 1982, 
It is viewed as both a training tool and standard testing 
instrument (WaLson^ Barker, Roberts, « Johnson, 1991) • Of the 
standardized listening tests now available (Rhodes, Vatsou, £ 
Barker, 1990), our test appears to match nost closely with the 
WaLson-Barker , which is the only commercially available test that 
seekp. to test listening skill of college students with a video 
format. Certain claims of validity have been reported for the 
original video version of the test which was produced In 1987 
(Rubin & Roberts, 1987; Watson « Barker, 1988; Roberts, 1988). 

Both tests were administered to selected sections of the 96- 
111 course. The Pearson product**moment correlation coefficient 
comparing the scores for a class of 23 was .65 (p>.00i), for a 
class of 20 it was .61 (p>.01) and for a total of 62 students in 
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three classes taught by a single instructot it was .61 (p>.001). 
This »ethod as an attempt to establish validity has been uaetl for 
earlier tests' XApplegate « Campbell, 1985; Rubin Roberts,l?87) . 
Some experts In testing refer to this technique as supporting 
construct validity (Popha»,1990) . The assuaption underlying this 
exercise is that if two tests cozrela'ce highly, as was found by 
these two tests, whatever validity is present in either Is at 
least somewhat shared by the other. 

The third kind of validity test used, the known groups 
■ethod, comparer, the scores of two groups, one of which is known 
to possesa hiyhet levels, and one of which possesses lower 
levels, of the properties of the construct being tested. 
Validity is suggested if the group identified a.-j possessing 
higher listening skills performs better on the test than the 
group possessing lower skills. The developers of the Kentucky 
Comprehensive Listening Test used this method In attempting to 
show validity of their instrument. They compared test scores for 
three groups: university students, high school students, and army 
colonels (Bostrom « Waldhart, 1988). 

We used the known groups methods in two ways. First, we 
administered the test to two classes taught by the name 
instructor on the same day near the end of the semester. One was 
a regular class; the other was an honors class. it seems 
reasonable to assume that one of the unique characteristics of a 
group or honors students is that they are better than average 
listeners. It is unlikely that a poor listener could become an 
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honors student. The aean score for the regular class vas 34.83, 
compared to a significantly higher aean of 39.82 for the honors 
class, (t-3.10, p>.01). The coaparison of these two classes, 
therefore, support test validity. 

The second known groups aethod vas used in another vay. One 
instructor taught a smII group conaunication unit and 
administered a test based solely on classrooa lectures, 
discussions, and activities to three sections of the course. 
There vas no reading assignaent in the unit. The test required 
that students reaeaber, understand, and apply principles of saall 
group coaaunicdtion in a real-life group of vhich they are a 
aeaber. 'it is reasonable to believe that students possessing 
better listening skills would perform better on the test than 
students less skilled at listening. Out of the 3 classes, 35 
students received either an or a fi on the test. Out of the 
same 3 classes, 22 students received either a or an £. The 
mean score on the Uiams — of the ti>«<-i for the 22 who received the 
li or E on the small group communication test vas 36.05, vhlle the 
mean score for the 35 who received ^s or fis vas significantly 
higher at 38.91 (t.2.4, p>.05). Thus, both applications of the 
known groups method, first vlth the honors students and secondly 
vith the high and lov small group communication test performers, 
supported the validity of the listening test. 

Validity vas also promoted by implementing suggestions and 
findings reported by listening assessment theorists. One example 
of this Ic the claim that a listening assessment should not 
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honors student. The nean score for the regular class vas 34.83, 
compared to a significantly higher mean of 39.82 for the honors 
class. (t«3.10, p>.01). The comparison of these two claitses, 
therefore, support test validity. 

The second known groups method was used in another way. One 
instructor taught a small group communication unit and 
administered a test based solely on classroom lectures, 
discussions, and activities to three sections of the course. 
There was no reading assignment in the unit. The test required 
that students remember, understand, and apply principles of small 
group communication in a real-life group of which they are a 
member. It is reasonable to believe that students possessing 
better listening skills would perform better on the test than 
students less skilled at listening. Out of the 3 classes, 35 
students received either an & or a E on the test. Out of the 
same 3 classes, 22 students received either a £ or an £. The 

mean score on the (Namg of the f-Psf^ for the 22 who received the 

H or E on the small group communication test was 36.05, while the 
mean score for the 35 who received Ls or Es was significantly 
higher at 38.91 (t=2.4, p>.05). Thus, both applications of the 
known groups method, first with the honors students and secondly 
with the high and low small group communication test performers, 
supported the validity of the listening test. 

Validity was also promoted by Implementing suggestions and 
findings reported by listening assessment theorists. One example 
of this is t)ie claim that a listening assessment should not 
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depend on reading and writing skills (Caffrey, 1955; Backlund, 
Brown, Gurry, & Jandt, 1982). Many previous tests In 
communication have required tho student to read and/or write to 
the point that reading and writing skills levels may have 
contaminated the purported purpose of the test. In this test, 
reading skill is not vital to success because everything printed 
on the video screen is also presented orally. The only written 
response necessary is making narks on a computer-scored answer 
sheet . 

We noted the advice that the methods of presentation should 
be controlled (Caffrey, 1949), which is best accomplished by 
videotape (Backlund, Brown, Gurry, & Jandt, 1982). Research on 
methods of presentation found that students score significantly 
higher on the Brown-Carlson and STEP listening tests when 
administered by an "effective" speaker than when administered by 
an "ineffective" speaker (Barker, Watson, t Kibler, 1984). 

Consequently, the iHajne of the tt^st) has been placed completely 

on videotape, thus controlling for methods of presentation. 

Most previous listening tests have used audio-only stimuli. 
But as Roberts points out, "listeners generally do not 'listen' 
with just their ears. Listening typically takes place while the 
listener Is hearing and viewing the sender of the message." He 
suggests that for a listening test to be useful in terms of 
applying results to everyday encounters, the respondent must be 
able to respond to a speaker's entire communication code, both 
verbal and nonverbal (Roberts, 1988). Consequently, in this 
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test all nessages are presented by people who are seen as well as 
heard In various settings. 

Validity can also be Influenced by the content of the 
stimulus material. One suggestion is that the material should be 
interesting and aeaningful to those taking the test 
(Backlund, Brown, Gurry, & Jandt, 1982). A further concern of the 
developers was that the test material be of somewhat equal 
interest, meaningf ulness, and familiarity to test-takers to 
reduce the chances of any of these elements giving an advantage 
or disadvantage to certain persons. While recognizing that no 
single piece of material can totally meet these criteria, we 
tried lu minimize differences by using material that should at 
least be soinewliat Interesting, meaningful, and familiar to 
college/university students. Some of the' scenarios present 
situations that are oriented uniquely to higher education and 
life as a student. 

Another suggestion is that a listening test should 
differentiate among three kinds of listening as defined partially 
by tlio time between the stimulus and the response. One kind is 
called short-term listening which calls for a response within 15 
seconds. Another is short-term with rehearsal, whicJi calls for a 
response within 40 seconds. The third Is called lecture 
listening, where the response comes at least one minute after the 
presentation (Bostrom & Waldhart, 1988). The (Name of thg t^sf ^ 
includes all three of these types of listening. The most short- 
term questions are those that ask for a response to one or two 
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sentence statements that respondents are supposed to Identify as 
either acceptable or unacceptable examples of evidence or as 
either sound or unsound examples of reasoning. The longest 
stiroulus is a four-minute speech about which 11 questions are 
asked * 

Finally, if the listening instruction is part of a broader 
communication course, it seems reasonable that some of the 
questions can assume a knowledge of basic communication 
principles. The test includes 17 questions that require some of 
this kind of knowledge to respond correctly. But the inclusion 
of these 9uestions raises a validity question: Is the validity 
o£ the test restricted to students who have received instruction 
in basic communication principles? To find the answer to this 
question, we administered the test to several sections of the 
course prior to basic communication instruction and to several 
sections after the instruction. Tlie fact that there was no 
significant difference between those tested before and those 
tested after communication instruction suggests that the 
knowledge of communication principles necessary for performing 
well on the test is so basic that this knowledge does not 
influence test performance. The validity of the test, therefore, 
is independent of knowledge of communication principles as taught 
in the course. 

RELIABILITY 

The next major concern in developing an assessment 
instrument is its reliability. Reliability, or consistency. 
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refers to the extent to which individual items on the test 
function in the same way. 

Two typical kinds of reliability tests were conducted on the 
test: test-retest and the Kuder-Richardson #20 (K-R20). A total 
of 49 students took the test two times within a 10-day period of 
time. A Pearson product-moment correlation coefficient of .68 
(1=4.37, p>.001) was calculated comparing the two sets of scores. 

The K-R20 test gives an overall reliability score indicating 
the average correlation obtained from all possible split-half 
reliabilities (Kuder-Richardson, 1937). For a group of 916 
students .taking the test in a single semester, the K-R20 
reliability score was .67 ( t=20 . 24 ,p> . 001 ) . 

Opinions vary in regard to what the reliability of a test 
should be to be deemed satisfactory or exemplary. The only item 
of agreement is "the liigher the better." One expert argues that 
a test should have a reliability coefficient of at least .65 to 
be considered satisfactory (Cangelosi, 1982). This test meets 
the .65 minimum, but does not exceerl it by much. Two factors 
operate to limit the reliability coefficient of tlie test compared 
to standardized tests boasting of higher figures. One factor is 
the relative shortness of this 55-question test; the more 
questions included on a test, the greater the potential for high 
reliability. Another factor is the relatively homogeneous 
population used for the reliability studies. Again, the 
potential for higher reliability would be increased by 
administering to a more diverse population than found in the 



ERIC 



13 



11 

group of students on a single university campus who have taken 
the test. 

BIAS 

A third subject to be addressed in evaluating a test is 
possible sources of bias. Bias occurs when questions are »ore 
easily or less easily answered because of experience which is 
unique to a particular group. Gender and race are often cited as 
possible sources of bias. 

In regard to gender, bias does not appear to exist to any 
significant extent in this test. Women students on our campus 
average approximately one more correct answer than men of the 55 
questions asked. While this is not a statistically significant 
difference, if the test is used so that a single point difference 
determines a student's grat'-. or whether a student is admitted 
into a professional program of study, one more right or wrong 
answer can make a profound difference. In this case, a closer 
look is warranted to make sure that the additional correct answer 
given by a woman is a reflection of her listening ability and not 
the result of gender bias In the test. 

Ho claims can be made at present concerning race bias. 
Although 3 minority students appear as talent in the test, the 
overwhelming majority of the students who have taken the test are 
Anglo-Saxon, young people, born in this state, between the ages 
of 18 and 21. The few minority students who have taken the test 
conijlitute an insufficient number for any analysis of race bias. 
Any institution or group using the test with members of a 
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minority oight conduct their own analysis to determine possible 
race bias. 

ITEM ANALYSIS 

The test has been subjected to item analysis to check for 
difficulty and discriminating power. In developing this test we 
made sure that the questions were difficult enough that scores 
did not cluster at the high end of the scale, but not so 
difficult that they clustered at the low end of the scale. 

The mean Item difficulty score for 916 students taking the 
test was 63.29. This is a satisfactory score because while It 
keeps scores from clustering at the top end, which would reduce 
discrimination power, it is not so difficult that students become 
demordlized aL its difficulty. Out of the total of 208 possible 
responses to the 55 questions, practically ' al 1 of the options 
receive at least some "bites" when administered to a class of 20- 
28 students. 

The mean item discrimination score for 936 students 
mentioned above was 22.97. With a score above 20, we are 
satisfied with the ability of the test Items to diner Iminate 
between highly skilled and less skilled listeners. It should be 
noted that the reliability and discrimination scores are somewhat 
related, moving up or down together. 

ADMINISTRATION 

The test can be administred successfully to up to 3D 
students In a single classroom by use of a one-half inch VMS 
video playback unit and monitor. Care should be taken to see 
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that all students are positioned so they can see the picture 
clearly on the monitor and hear the audio portion of the tape. 
Computer-scored answer sheets providing for at least 4 options to 
55 questions can be marked with a pencil and machine scored. 

The test takes 45 minutes to administer. Test 
administrators can choose to run straight through the test 
without a break. But unlike pencil and paper tests where 
students can look ahead at questions, because this test Is on 
videotape, breaks can be taken at any time, or the test can be 
administered in parts in as many blocks of time as desired. In 
fact, because of the sustained concentration necessary in taking 
the test,* even a pause for a few seconds some time during the 
test is recommended. 

NORMS 

Of 916 students tested, the scores ranged from a low of 15 
to a high of 52. The mean score was 34.77 with a standard 
deviation of 5.55. The mode was 36. 

The percentile ranks corresponding to raw scores are shown 
below : 

Percentile Raw scor«>s 

90 42 

80 39 

70 38 

60 36 

50 35 

40 34 

30 32 

20 30 

10 28 

These figures show that a score of 42, for fexaraple, is 
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higher than 90\ of the total scores. Likewise, a score of 30 is 
higher than 20% of the total scores. 

CONCLUSION 

To summarize this report, first, ve explained the need we 
experienced for a standardized listening test. Then we 
identified some of the literature in test construction, 
especially relating to listening assessment. Next, we explained 
how we applied information in the literature to the development 
of the test. Finally, we reported data on the nature and 
effectiveness of the test as an instrument to assess the 
listening skills of students in our basic speech communication 
course . 



17 



15 

W-Oihkofh Listening Test - PioClle of Questions 



Question 
Number . 


Listening to 
Coaprehend 


Listening Ccltl- .1 
cally/tvaluatlvely 


Listening 
Eapathlcaily 


Secogtilrlng 
Cobb. Principles 


i 
2 

. 3 
4 

Pour Hlnute $ 
.Speech *. ( 
T 
t 

10 
11 


X 
X 
X 
X 
X 
X 
X 
X 
X 

« 


X 1 

• 1 




X 
X 
X 

• X 


12 

Student 13 
Council 14 
Ipnounceaent 15 


X 
X 
X 






- 


16 

Open Meeting 17 


X 








19 

hit Face 20 
Announceaent 21 


X 
X 

• 

X 




• 




Directions 23 
Cor Chest 24 
I-R«v 


X 
X 








26 

Description 27 . 
o£ a State 28 
Park 29 


X 
X 
X 
X 








31 

Use of Room 32 
Cor Veekend 33 


X 


• 


X 


X 
X 


35 

Bad Relation* 36 
ohlp llth 37 
Third Parfcv 1« 


X 
X 








39 

Problea 40 

Takina Test 41 






1 ^ 


X 


42 

Problea Vlth 43 
Grade on 44 
Paoer 41; 


X 




X 


* 

X 


4G 

Job 47 
Intervlevs 48 


X 
X 




X 


. X 

X 


SO 

Assessing Use SI 

of Evidence %'> 




I 


1 1 


X 
X 
X 


Assessing 53 
Use of S4 
Sfiasontna 55 








X 
X 
X 
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